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Characterization of Subgraphs Relationships and 
Distribution in Complex Networks 

Lucas AntiqueirgO and Luciano da Fontoura Cost4!l 

Instituto de Fisica de Sao Carlos, Universidade de Sao Paulo, 

Av. Trabalhador Sao Carlense 400, Caixa Postal 369, 

CEP 13560-970, Sao Carlos, Sao Paulo, Brazil 

A network can be analyzed at different topological scales, ranging from single nodes to motifs, 
communities, up to the complete structure. We propose a novel intermediate-level topological anal- 
ysis that considers non-overlapping subgraphs (connected components) and their interrelationships 
and distribution through the network. Though such subgraphs can be completely general, our 
methodology focuses the cases in which the nodes of these subgraphs share some special feature, 
such as being critical for the proper operation of the network. Our methodology of subgraph char- 
acterization involves two main aspects: (i) a distance histogram containing the distances calculated 
between all subgraphs, and (ii) a merging algorithm, developed to progressively merge the subgraphs 
until the whole network is covered. The latter procedure complements the distance histogram by 
taking into account the nodes lying between subgraphs, as well as the relevance of these nodes to 
the overall interconnectivity. Experiments were carried out using four types of network models and 
four instances of real-world networks, in order to illustrate how subgraph characterization can help 
complementing complex network-based studies. 
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I. INTRODUCTION 

Because of their flexibility to represent, model and 
simulate virtually any discrete structure, complex net- 
works [il, [3, S, 0, [a] have been extensively studied 
and applied to the most diverse problems |6|], rang- 
ing from transportation (e.g. flights [3]) to commu- 
nications (e.g. Internet Q). Complex networks are 
'complex' because they exhibit particularly intricate 
and heterogeneous connectivity, e.g . by involving 
or communities [id . Illl |. As shown re- 
13| , most real- world complex networks also 
include in their structure regular patches of connectiv- 
ity, i.e. subgraphs whose nodes present similar topo- 
logical measurements. All in all, the heterogeneity of 
complex networks tends to range along several topo- 
logical scales, extending from the individual node level 
through mesoscopic structures such as modules and 
regular subgraphs, up to the whole network level. As 
a matter of fact, it is precisely the heterogeneous dis- 
tribution of structural features along the several scales 
which defines the intricate organization and most in- 
teresting structural and dynamical properties of com- 
plex networks. 

Although several works have investigated meso- 
scopic features of the connectivity of complex net- 
works, e^ by considering their respective communi- 
ties [ifl [U and/or paths between differeiit portions of 
the networks [ijjllal, few works (e.g. [ij,[ial) have fo- 
cused on the study and characterization of the distri- 
bution of nodes and subgraphs within a given network. 
Such nodes and subgraphs of special interest arise in 
several situations, not only as communities or regular 
patches, but also with respect to extreme values of 
specific topological measurements. For instance, the 
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nodes (or edges) with betweenness centrality higher 
than a given threshold can give rise to several sub- 
graphs inside a network. It should be observed that 
such subgraphs, investigated in this work, do not nec- 
essarily yield a partition of the original network, as 
typically they do not encompass all the original nodes. 
At the same time, these subgraphs are henceforth as- 
sumed to be connected components and not to overlap 
one another. 

Given a set of disjoint subgraphs of a network ex- 
pressing specific properties of interest, it becomes crit- 
ically important to characterize how these subgraphs 
are distributed through the network, as such an infor- 
mation can be particularly important regarding the 
overall organization of the network and its dynamics. 
Going back to the above example with betweenness 
centrality, if highly central subgraphs are found to be 
close one another (in terms of shortest path length 
between them), the portion of the original network 
containing such subgraphs can be understood as cor- 
responding to a critical bottleneck for the whole sys- 
tem under analysis. On the other hand, a more uni- 
form distribution of subgraphs with highest between- 
ness centrality suggests a system less critically struc- 
tured for communications. Several similar situations 
can be characterized with respect to other types of 
subgraphs, including other measurements as well as 
communities and regular patches. Yet, few works have 
addressed the specific issue of how critical nodes or 
subgraphs are distributed through the network topol- 
ogy. 

The objective of the present work is to develop and 
apply a comprehensive framework for characterization 
of the distribution of subgraphs of specific interest 
within a given complex network. In order to do so, 
we resource to the distances, quantified in terms of 
the shortest path lengths, between each pair of given 
subgraphs. Such distances are organized into a his- 
togram, which can provide valuable information about 
the topological distribution of the subgraphs. For in- 



stance, a sharp peak in such a histogram at a small 
value of distance will indicate that the subgraphs are 
all close one another. Though the distances between 
subgraphs provide valuable information about their 
overall distribution, it is also interesting to have the 
means for progressively merging subgraphs in order to 
obtain connected components incorporating the crit- 
ical regions. Therefore, we also report an algorithm 
which allows the progressive merging of the subgraphs 
in terms of successive distance values, up to the point 
of containing all the given subgraphs. This merging 
is based on the morphological operation called dila- 
tion. Figure \T\ depicts a network with four subgraphs 
and some bidirectional arrows which denote the dis- 
tance and merging relationships we want to charac- 
terize. The potential of the distance histograms and 
subgraph merging algorithm are illustrated with re- 
spect to both theoretical as well as real- world coinplex 
networks, including the Barabasi-Albert model 1)\ as 
well as the power grid of the western states of the USA 
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FIG. 1: A graph containing four subgraphs whose nodes 
are highlighted. The approach reported in this article is 
aimed at characterizing the topological distribution and 
relationships between these subgraphs. 

This article starts by presenting the basic concepts 
and methods (Section |TT]) and proceeds by describing 
the distance histogram and merging algorithm (Sec- 
tion imp , which are then illustrated with respect to 
theoretical and real- world networks (Section IIV|) . 



II. BASIC CONCEPTS AND METHODS 

A network can be represented by a graph G{V,E), 
where V = {vi, V2, ■ ■ ■ ,vpf} is its set of N vertices (or 
nodes) and E = {ei, 62, . . . , gl} is its set of L edges (or 
links). An edge e^ is a pair {vi,Vj) that represents a 
connection between nodes Vi & V and Vj d V. The set 
of edges E can be encoded into an adjacency matrix 
A, of dimension N x N, with elements A{i,j) = 1 
whenever there is an edge from node i to node j, with 



^ihj) — being imposed otherwise. Notice that A is 
symmetric for undirected graphs. 

In what follows we present the definitions of the 
adopted measurements for undirected graphs, as well 
as the details of the considered artificial and real- world 
networks. 



A. Netw^ork Measurements 

The network measurements reviewed in this section 
have been frequently employed in the field of complex 
networks. For more details, please refer to the review 
article [5|. 

a. Degree: The node degree k{i) corresponds to 
the number of edges attached to a node i. Using the 
adjacency matrix A, the degree can be obtained by: 



N 



fc(^) = ^A(^,J) 



(1) 



b. Clustering Coefficient: This measurement re- 
flects the density of connections between the neighbors 
of a node i. Let: 

'7(i) = {j I A{i,j) = l,i^j} 
be the set of neighbors of i, and: 

e(z)= J2 ^("'") 

be the number of edges between the neighbors of i. 
The clustering coefficient of a node i is defined as: 



c(^) 



2e(*) 



hWI(hWI-i)^ 



(2) 



where |??(i)| is the cardinality of ?/(i), i.e. it is the 
number of neighbors of i. 

c. Length of Shortest Paths: The proximity be- 
tween nodes is usually quantified in terms of shortest 
paths. A path p{i,j) extending from node i to node j 
is denoted by a sequence of neighboring nodes: 

P{hj) = {vi,V2,---,Vl,Vi+i) 

where A{vi,Vij^i) = 1, ?Ji = i, w;+i — j and the length 
of the path is uj{p{i,j)) = I. Notice that the length of 
a path is the number of edges along it. The length of 
the shortest path between two nodes i and j is thus 
given by: 



s{i,3) = mm{uj{p{i,j))}, 



(3) 



which is the minimum amount of steps (edges) needed 
to reach node j after starting at node i (or vice versa 
in the case of undirected networks). 

d. Betweenness Centrality: This measurement is 
closely related to the shortest path. Consider the set: 

ct(«,J) = {p{i,j) I uj{p{i,j)) = s{i,j)} 

of all shortest paths between i and j. Moreover, the 
set: 

<^{hv,j) = {p{hj) I P{hj) e cr(i,j) andt) (Ep{i,j)} 



contains all the shortest paths between i and j that 
pass through node v. The betweenness centrality of a 
node V is given as: 



biv) = E 



i^j 






(4) 



which takes the sum over all possible pairs of distinct 
nodes i and j. Informally speaking, this centrality 
measurement quantifies the participation of v in min- 
imum paths. 



B. Network Models 

Four theoretical network models were chosen in 
this work in order to construct undirected networks. 
For each model, 100 realizations (networks) were per- 
formed with N = 1, 000 nodes and mean degree 
(k) = 6. The general characteristics of these mod- 
els are given below; 

a. Erdos-Renyi (ER): In this model, every pos- 
sible pair of nodes {i,j) is connected with uniform 
probability p j 18|| . For an ER network, the mean node 
degree is given by (fc) = p{N — 1) in the large network 
limit N —> oo. Moreover, this model yields random 
networks with a Poisson degree distribution, which 
implies a characteristic mean degree, i.e. the node 
degrees do not greatly deviate from (fc). 

b. Watts-Strogatz (WS): The WS model gener- 
ates networks exhibiting the small-world property, i.e. 
high average clustering coefficient and low average 
shortest paths [l^ . In order to obtain a WS network, 
we start with a regular ring-shaped network with A'' 
nodes, where every node is connected to its k near- 
est neighbors in both directions. Then, each edge is 
moved (rewired) to another position with probability 
p. Depending on p, an ER network can approach the 
features of the initial regular network (for p — > 0) or 
of a random network (for p -^ 1). In our experiments, 
we employed p = 0.2. Also notice that the mean node 
degree of a WS network is (k) = 2k. 

c. Barabdsi- Albert (BA): Networks with a 
power-law degree distribution can be obtained by 
considering the BA model Q. This type of network 
contains a few nodes, called hubs, concentrating many 
connections, while the majority of nodes have only a 
few links. A BA network is generated by adding new 
nodes to an initial network of ttt-o nodes. Each newly 
added node is connected to m previous nodes, with 
the probability of connections being proportional to 
the respective degrees. The average node degree of a 
BA network is (fc) = 2m. 

d. Geographical (GG): In contrast to the ER, 
WS and BA models, a geographical model considers 
the spatial position of nodes to create edges Q , so that 
the spatial adjacency between nodes often strongly in- 
fluences the respective connectivity. In the geographi- 
cal model adopted here (called GG), randomly placed 
nodes are distributed through a bi-dimensional grid 
of size L X L, and edges are established among nodes 
geographically close to each other, i.e. separated by 
a distance not greater than R. Thus, long-range con- 
nections are not created, implying longer paths than 



in the previous models presented in this section. The 
mean degree of a GG network can be estimated as 

(fc) « ttR^N/L'^. 



C. Real- World Networks 

A set of real networks, as described below, has also 
been used in our experiments. These networks have 
been chosen so as to provide a representative sample 
of several types of real-world networks of general in- 
terest. Table U provides a quick reference with basic 
information about each network. 

a. Co-authorship in Network Science: This net- 
work, called NetScience, expresses the co-authorship 
relationships between scientists that published papers 
in the field of complex networks Il9ll. It was compiled 
by M.E.J. Newman in May 2006 [2J] using the refer- 
ences cited in two surveys of the field [3, [j| , plus some 
manually added references. Each scientist is a node 
in this network, while an undirected edge is created 
between two scientists whenever they have published 
at least one joint paper. NetScience has 1, 589 nodes, 
of which 379 are inside the largest connected compo- 
nent, which is the part we used in our experiments. 
Henceforth, whenever we mention NetScience, we re- 
fer to its largest connected component. Moreover, we 
do not take into consideration the original weights of 
this network. 

b. Email Communications: We also considered a 
graph rcficcting the flow of email messages exchanged 
among the members of the University at Rovira i Vir- 
gili (Spain) [20|. This network, compiled in the re- 
search group of A. Arenas [25|, has a single connected 
component, where each email address is identified by a 
node (there are A^ = 1, 133 addresses), and a message 
sent from node i to node j is represented by an undi- 
rected unweighted edge (?,j). The authors removed 
bulk emails, i.e. messages sent to more than 50 ad- 
dresses, before defining the edges in this network. 

c. Power Grid: This network represents the 
topology of the power grid of the western states of the 
USA IITII . and was compiled by D. Watts and S. Stro- 
gatz [26J . A power grid is the structure that underlies 
the transmission of electricity from power plants to 
consumers. The power grid of the USA western states 
is a single connected component with 4, 941 nodes in- 
terconnected by undirected and unweighted links. 

d. Internet-AS: The connections that associate 
ASs (Autonomous Systems) in the Internet were con- 
sidered by M.E.J. Newman in the compilation of this 
network |27| . An AS is a group of computer networks 
that share the same routing policy and have a cen- 
tralized administration. Using BGP (Border Gate- 
way Protocol) data of July 22, 2006, Newman recon- 
structed the links between 22,963 ASs, which yielded 
a connected graph with unweighted and undirected 
edges. Due to the nature of BGP, which is a gateway 
protocol used to route data packets between ASs, it 
was possible to retrieve information about the physi- 
cal links of the Internet at the AS level. 



TABLE I: Basic information about the real networks used 
in our experiments. For each network, we show its number 
of nodes A*', number of edges L and respective mean degree 
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L 


{fc> 


NetScience 


379 


914 


4.82 


Email 


1,133 


5,451 


9.62 


Power Grid 


4,941 


6,594 


2.67 


Internet-AS 


22,963 


48, 436 


4.22 



III. CHARACTERIZATION OF SUBGRAPHS 




The method for analyzing subgraph interconnec- 
tivity introduced in this section assumes that the 
graph/network G{V, E) under study is undirected, un- 
weighted and connected. It also requires that C sub- 
graphs Gi = {Vi,Ei), 1 < i < C, he defined such 
that: 

(i) V, C V, 

(ii) V,^0, 

(iii) ViDVj ~ ior every i ^ j, 
(iv) [vi ,Vj) G Ei if and only if Vi e Vi , Vj G Vi and 

{vi,Vj) e E, 

(v) Gi is a connected component, 

(vi) Two different subgraphs Gi,Gj are not direct 
neighbors. 

In order to create a subgraph Gi, it is enough to define 
a valid Vi, since Ei must contain all the edges of E 
that connect pairs of nodes included in Vi (rule (iv)). 
We refer to a 'valid' Vi as a non-empty subset of V 
(rules (i) and (ii)) that results in a connected compo- 
nent (rule (v)) not sharing any nodes or edges with 
other subgraphs (rules (iii) and (vi)). Furthermore, 
the selection of nodes for subgraphs Gi also depends 
on the specific study being carried out, given that the 
above conditions are followed. Figure [D shows an ex- 
ample of a graph G with iV = 64 and four subgraphs 
Gi, . . . ,G4. 



A. Distance Histogram 



FIG. 2: Graph G = {V,E) with N ^ 64 nodes, where 
each node is identified by a number v. This graph has four 
subgraphs Gi, G2, G3 and G4, whose set of nodes are Vi = 
{6,7,8,15,16,23}, V2 = {18,19,20,27,28,29,37}, V3 = 
{26,33,34} and V4 = {48,55,56,64}. These subgraphs 
correspond to the four connected components with black 
nodes. 



one, which changes the length of the distance from an 
edge-orientation to a node-orientation. This modifi- 
cation has been done because there must be at least 
one node, or two edges, between a pair of subgraphs 
(i.e. the distance would start at two). Thus, a node- 
oriented distance is preferred because it is more in- 
tuitive. We therefore define the matrix Ds of order 
C X G with elements Ds{i,j) = s{Gi,Gj), i.e. it en- 
codes every distance between all G subgraphs of G. 
As an example, the matrix Ds for the four subgraphs 
in Figure [His given as: 



D., 



where Ds is symmetric because G is an undirected 
graph. If these distances (excluding the diagonal of 
Ds) are placed in a histogram, the overall proximity 
between subgraphs can be examined more easily than 
just observing Dg, as will become clearer in the ex- 
periments reported in Section ITVl 
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One way to analyze subgraph interrelationship is 
by computing the distance between every pair of sub- 
graphs. We define this distance to correspond to the 
length of the shortest path between subgraphs and, 
therefore, ^ can be used for this purpose. Notice 
that there are at least |T^i||V}| different paths between 
two different subgraphs Gi and Gj , because each path 
may start at any node of the source subgraph and end 
at any node of the destination subgraph. The length 
of the shortest path between two different subgraphs 
can now be defined as: 



s{Gi, Gj 



rnin {s{wi,Wj)} 

Wi£Vi,WiEVi 



1, 



(5) 



imposing that s{Gi,Gj) = whenever i = j. Notice 
that the length of the shortest path is decremented by 



B. Subgraph Merging 

The method detailed in this section aims at grad- 
ually merging subgraphs Gi inside the original graph 
G, while giving special attention to the relationship 
between them. We implement this progressive merg- 
ing, or expansion, in terms of the gradual growth of 
the subgraphs Gi towards graph G, which is accom- 
plished by adding to the subgraphs Gi nodes of G 
that do not belong to any Gi yet (and also adding the 
necessary edges, as specified in the definition of the 
subgraphs Gi presented earlier in this paper). 

In order to achieve a gradual subgraph merging, 
some vertices of G need to be included in the expan- 
sion earlier than others. In our methodology, higher 



relevance is given to the nodes inside a short path con- 
necting some pair of different subgraphs {Gi,Gj). In 
this manner, the merging is controlled by the length 
of paths between subgraphs. More specifically, for a 
node V outside every subgraph Gi, we compute the 
length of the minimum path between all pairs of differ- 
ent subgraphs {Gi,Gj) that necessarily pass through 
node V. This is understood as the relevance of a node 
V in the merging of subgraphs. In other words, a node 
that is close to only one subgraph is considered a mem- 
ber of a weak tie, because it does not take part in short 
paths connecting that subgraph with others. 

The aforementioned merging can be carried out by 
applying consecutive dilations [ij, |2l| in the sub- 
graphs of G. The dilation is a morphological oper- 
ation 6{g), defined over a subgraph g of G, that yields 
another subgraph that is equal to the union of the 
original subgraph and its neighbors in G (plus the re- 
spective edges). Figure [3] illustrates the dilation of the 
subgraph Gi of Figure [21 which is formed by vertices 
Vi = {6, 7, 8, 15, 16, 23}. The dilation S{Gi) results in 
a subgraph with nodes Vi U {14,22,24} (i.e. nodes 
{14,22,24} are neighbors of Gi), along with the re- 
spective edges that connect these nodes in G. 



of edges, from Gi. For d = 0, the recursive dilation 
is defined as (5o(Gi) = Gi, and because the dilation 
S-i{Gi) is not possible, the nodes inside Gi arc natu- 
rally defined to be at distance from Gi. It is worth 
pointing out that these distances are different from 
those given in the previous section (from matrix Ds), 
which are only calculated between subgraphs, not be- 
tween a subgraph and every node of the network. 

Since we are going to dilate all G subgraphs, it is 
necessary to apply the dilation 5d{Gi) without consid- 
ering the nodes of other subgraphs Gj, i ^ j. This 
particular behavior is required because the merging is 
made outwards the set of subgraphs, and thus it is 
not necessary to consider nodes that already belong 
to a subgraph. Therefore, when dilating a subgraph 
i, some other subgraph j may block the accessibility 
of i to some nodes in the graph. For example, in Fig- 
ure subgraph G2 can not communicate with nodes 
25 and 41 because subgraph G3 is blocking its access 
to these nodes. In fact, in this example, only subgraph 
G3 is able to communicate with nodes 25 and 41, or, 
conversely, these nodes can only access subgraph G3. 
In what follows, we define the set of subgraphs acces- 
sible from a node v as: 




FIG. 3: Dilation 5{Gi) of the subgraph Gi of Figure g] 
Gi has nodes Vi = {6, 7, 8, 15, 16, 23}, and the dilation is 
formed by nodes Vi U {14, 22, 24}, represented in black in 
the figure. 

Dilations are employed as an intermediate step in 
our method. When 5{Gi) is applied sequentially and 
recursively inside G, until no more dilations are pos- 
sible, a distance map is traced between the subgraph 
Gi and the other nodes of G. This recursive dilation 
is denoted by: 



<5rf(GO = <5((...(G,)...)), 



(6) 



d times 



and the nodes that are included in 5d{Gi), but not 
in 5d-i{Gi), are said to be at distance d, in number 



Q{v) = {G, I f e 6d{Gi), for < d < 00}, (7) 

and the total number of subgraphs accessible from a 
node V is: 



q{v) = \Q{v)\ 



(8) 



where |Q(f)| is the cardinality of Q{v). Notice that 
q{v) > 1 for any v, due to the fact that G is a con- 
nected graph. 

The complete set of dilations Sd{Gi), which takes 
into account every subgraph Gi and every possible 
dilation starting from d = 1 (until there is no more 
nodes to be added by the dilation), allows the def- 
inition of a distance matrix Ds, of order N x G. 
An element Ds{v,i) of Ds indicates the distance be- 
tween node V and subgraph Gi. As observed before, 
this distance is given by the dilation in number of 
edges. There is an exception requiring special treat- 
ment: for a node v not accessible from a subgraph Gi, 
i.e. Gi ^ Q{v), we define Ds{v,i) — d^ax + 1, where 
dmax = A^ — 1 is the maximum possible distance be- 
tween two nodes in a graph with N nodes. 

The definition of the matrix Ds is the last step be- 
fore specifying a relevance value for each vertex. The 
shortest path between any two subgraphs Gi and Gj , 
i j^ j, that necessarily pass through node v, is then 
defined as the relevance r{v) of node v. Consequently, 
the lower r{v), the higher is the relevance of node v 
in the merging of the subgraphs. More formally, r{v) 
is given by: 



r{v) 



niin \Ds(v,i) 

l<t,j<C,t^j 

^min^{i:'5(w,«)}, 



Ds{v,3)}^l, ifg(«)>l 
if q{v) = 1 



(9) 



where the second case is an exception that occurs 
when node v can access only one subgraph (or it is in- 
side a subgraph), thus r{v) is equal to Ds{v, i), where 
i is the index of the only subgraph to which v is con- 
nected. In this case, r{v) is solely controlled by the 
consecutive dilations of a single subgraph. Observe 
also that, in the first case, the minimum expression 
is decreased by one. This decreasing scheme was in- 
cluded because otherwise r{v) would always be greater 
than one for nodes with q{v) > 1, an odd behavior for 
a quantity that starts at zero. In other words, the rele- 
vance was changed from an edge-oriented to a vertex- 
oriented one, similarly to what was done with ma- 
trix Ds in the computation of the distance histogram 
(Section IIII Ap . Therefore, (O allows relevance val- 
ues greater or equal than one for every node outside 
subgraphs Gi, while zero relevance is reserved for the 
nodes inside some subgraph. 

FigurelHshows r{v) for every node of the graph illus- 
trated in Figured) In this example, r{v) is shown both 
numerically and graphically. The latter approach uses 
a gray scale proportional to r(v), ranging from black 
(when r{v) = 0, which refers to the more relevant 
nodes) to white (when r{v) ~ 8, representing the less 
relevant nodes in this example) . Notice that the dark- 
est nodes are placed in the shorter paths that connect 
subgraphs Gi, . . . , 64. The nodes of these subgraphs 
correspond to those with r{v) — 0. 




FIG. 4: Values of r{v) for each node in the graph G of 
Figure [S] The number next to each node denotes r{v) 
(please, refer to Figure [2] for the number v of each node). 
The color of each node is derived from a gray scale ranging 
from black {r{v) = 0) to white {r{v) = 8). 

Finally, the merging of the subgraphs of G is per- 
formed by thresholding the relevance values as follows: 



V+ = {v \ r{v) < T}, 



(10) 



where T > is an integer threshold. In addition, C+ 
subgraphs Gf — {V^'^,E^), I < i < C+, are created 
such that: 



(i) y+ u K,+ u . . . u v: 



c+ 



^/^ 



(ii) V^^ nVj^ = for every i^ j, 

(iii) (vi ,Vj) e E^ if and only if Vi G V^^ , Vj g V^ 
and {vi,Vj) G E, 



(iv) GJ^ is a connected component, 

(v) Two different subgraphs Gl^ ,G^ are not direct 
neighbors. 

These rules are similar to those given in the definition 
of the original subgraphs Gi, with the difference that 
the merged subgraphs are restricted to the vertices 
belonging to the thresholded set V^ . To summarize 
the process of merging, it suffices to take into account 
that the new subgraphs are the connected components 
that remain when nodes v ^ V^ (and their edges) are 
excluded from G. 

Thresholding with T = gives the original sub- 
graphs, as r{v) is equal to zero if and only if v belongs 
to some subgraph Gi . Since greater thresholds include 
other nodes, two or more subgraphs can then be joined 
into one single connected component. An example of 
a merging for a threshold T = 2 is given in Figure [3 
where the input for the merging is the graph G and 
its subgraphs depicted in Figure [H Figure [5] shows in 
black the nodes that belong to the merging, which re- 
sults in a single connected component joining all four 
subgraphs of G. 

It is worth pointing out that recursive dilations 
could be used as a method of subgraph merging (and 
not only as an intermediate step), where each sub- 
graph would be simulateneously dilated until the en- 
tire network was covered. Nevertheless, Figure [3] 
shows that the first dilation of the subgraph Gi in- 
cludes nodes 14 and 24, whereas our merging does not 
include any of them until T = 3 (see Figured]). Al- 
though nodes 14 and 24 are neighbors of Gi, they are 
farther from other subgraphs, and thus they do not 
participate in short paths linking a pair of subgraphs. 
This comparison shows that dilations do not discrimi- 
nate the relative position of nodes between subgraphs, 
and this is the reason why we have defined the matrix 
of distances Ds and the relevance values r{v). 




FIG. 5: Merging G^ inside G (see Figure[2| for a threshold 
r = 2. The value next to each node is its relevance r(v), 
and the nodes of the subgraph merging are those shown in 
black. 

If the merging is computed for sequentially increas- 
ing thresholds starting at T = 0, V'^ grows until 
V^ — V, i.e. the subgraphs Gi expand until they 
form a single connected subgraph that is equal to G, 
which we call gradual merging. The number of sub- 



graphs (or connected components) C+ in the merg- 
ing can be monitored until the end of the sequential 
thresholding, when C"*" must be equal to one. In this 
case, C"*" is a monotonically nonincreasing function of 
T. 

In a gradual merging, it is possible to verify the 
overall proximity of the original subgraphs by observ- 
ing how fast C"*" drops to one, thus complementing the 
distance histograms explained in Section IIII Al Fur- 
thermore, the number of nodes in y+, for sequentially 
increasing thresholds, is useful to assess the overall rel- 
evance r{v) of nodes and also to measure how many 
nodes are necessary to bring together all the original 
subgraphs. In order to properly explain the utilization 
of gradual expansion and to illustrate its potential to 
complement the distance histogram, we give in the 
next section examples of subgraph characterization in 
both real- world and artificial networks. 



IV. EXPERIMENTAL RESULTS AND 
DISCUSSION 

We now illustrate the application of subgraph char- 
acterization to a set of artificial and real-world net- 
works. As already mentioned in Section IIIBl 100 re- 
alizations of models ER, WS, BA and GG were ob- 
tained, each one with N — 1,000 nodes and mean 
degree (k) = 6. The real networks were introduced in 
Section IlICi namely NetScience, Email, Power Grid 
and Internet-AS, with TV ranging from 379 to 22, 963 
and (fc) between 2.67 and 9.62 (please, refer to TableU 
for more details about real networks). The chosen 
networks cover a considerable range of types of struc- 
tures usually studied in the field of complex networks 
[ll, S S 0, [^] , therefore providing a representative ba- 
sis for the illustration of our method. 

An important step is the definition of the subgraphs 
to be analyzed. To perform this task, we have cho- 
sen the measurements (i) betweenness centrality b and 
(ii) clustering coefficient c (both explained in Sec- 
tioning. More specifically, the nodes with the high- 
est & or c were chosen to form subgraphs Gi accord- 
ing to the definition presented in Section [TTTl Thus, 
subgraphs Gi correspond to the connected compo- 
nents (sometimes containing only one node) existing 
between the nodes with highest betweenness central- 
ity (or clustering coefficient), limited to 2.5% of the 
total number of nodes. The use of betweenness cen- 
trality, already mentioned in the introductory section 
of this paper, is particularly important in which con- 
cerns proximity between groups of critical nodes. In 
other words, if these subgraphs are close to each other, 
the network may become particularly sensitive to di- 
rected attacks on central nodes. Subgraphs with high 
clustering coefficient are also interesting to analyze 
because they tend to be more cohesive than others, 
i.e. showing a variety of different paths between its 
nodes. The characterization of the connectivity be- 
tween dense subgraphs may lead to the improvement 
of search and transport strategies and also of network 
designs. It is important to observe that the analy- 
sis of the overall distribution of the critical subgraphs 



through the network can provide complementary in- 
formation to the already important insights provided 
by those measurements at the local topological level, 
as done traditionally. 

In the next subsections we report the results ob- 
tained for the aforementioned artificial and real-world 
networks regarding distance histograms and subgraph 
merging. Since 100 realizations of each network model 
have been performed, the following results are pre- 
sented in terms of average measurements and respec- 
tive standard deviations. Notice that for the real net- 
works this procedure was not necessary because only 
one network was available for each case. 



A. Network Models 

Figure E] shows the average distance histograms for 
the subgraphs consisting of nodes with the highest 
betweenness centrality in ER, WS, BA and GG net- 
works. Models ER and WS have well-defined peaks at 
distance 2 and 3, respectively. Moreover, distance val- 
ues do not greatly deviate from the respective peaks 
in both histograms, which shows that groups of cen- 
tral nodes tend to be close one another in these two 
models. Interestingly, the BA model shows a peak at 
distance 0, i.e. subgraphs are likely to be in the same 
connected component in this model. Thus, ER, WS 
and BA models have central subgraphs similar one 
another, although at different intensities. This "cen- 
tral region" plays an important role in a network if we 
consider that the betweenness centrality measurement 
reflects well the importance of nodes in dynamical pro- 
cesses taking place in the network. For instance, dis- 
eases (or news) may spread fast in a social network 
(typically well-modeled by WS) if the first infected (or 
informed) people are inside the central region. Proce- 
dures to stop epidemics may also have a higher success 
if applied mostly at the central region. If we con- 
sider transport processes, the central region needs to 
deal with considerable higher traffic than the rest of 
the network and also needs to have stronger security 
policies against attacks, otherwise a critical bottleneck 
may arise. A very different behavior is shown by the 
model GG, with distance values as high as 35. In 
general, subgraphs in geographical networks are more 
likely to be distant 1-10 nodes apart, with lower prob- 
abilities for higher distances. Although the histogram 
for the GG model is not uniform, subgraphs consist- 
ing of nodes with high betweenness centrality tend to 
be scattered over GG networks. Thus, these networks 
do not have a main bottleneck since traffic would be 
decentralized. Moreover, spreading processes would 
also be slower in GG networks than in networks of 
ER, WS and BA types. 

The histograms reproduced in Figure [7] were ob- 
tained by taking nodes with high clustering coefficient 
as references. These subgraphs are farther from each 
other than the subgraphs created using betweenness 
centrality when considering a comparison between the 
same network models. For instance, the models ER 
and WS had their distance peaks increased from 2 
to 3 and from 3 to 5, respectively, when comparing 
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FIG. 6: Average distance histograms and respective standard deviations considering 100 realizations ol each network 
model (ER, WS, BA and GG). The subgraphs were created using the nodes (2.5% of A'^) with the highest betweenness 
centrality. 



histograms of Figures [S] with those of Figure [71 Nev- 
ertheless, these changes are relatively small when con- 
sidering the total number of nodes in these networks 
{N — 1, 000). Subgraphs in BA networks now tend to 
be 2 or 3 units of distance apart, rather than being 
all connected as in the previous BA histogram. Nev- 
ertheless, clustered subgraphs can be considered close 
to each other in models ER, WS and BA, given the 
much larger size of the networks when compared to the 
values in the distance histograms. More significant in- 
creases in subgraph distances were observed in the ge- 
ographical model, where subgraphs were found to be 
as far as 60 nodes apart. Now, a peak around 20-30 
was found in the distance histogram of the GG model, 
with a slow decay for higher distance values. Clus- 
tered subgraphs can be regarded as a group of nodes 
with redundancy of connections, since many paths ex- 
ist between two nodes in the same cluster. Thus, ER, 
WS and BA networks show link redundancy at nearby 
locations, which can be an undesirable bottleneck for 
transport networks in case of problems with this "clus- 
tered region" . Moreover, random failures happening 
inside low clustered regions would isolate some nodes 
that do not have connection redundancy around them. 
GG networks, on the other hand, have their clustered 
subgraphs more dispersed, implying that connectivity 
redundancy is not concentrated in a single region of 
the network. We argue here that GG networks would 
then be more tolerant to localized attacks since clus- 
tered subgraphs are spread all over the network. 

We now turn our attention to the gradual merging 
of subgraphs. Our approach consists in monitoring 
the number of subgraphs C+ and the number of nodes 
|y+| in the gradual merging as threshold T increases. 
The plots in the left column of Figures [5] and M show. 



for each network model, the number of subgraphs C"*" 
as a function of the merging threshold T, while the 
right column shows the number of nodes l^^"*"] as a 
function of T. Both quantities were divided by the 
total number of nodes N in the network, thus nor- 
malizing their range throughout this paper. Observe 
also that when the C"^ curve stabilizes, its absolute 
value becomes equal to one. 

Figure [S] shows the results for network models with 
subgraphs derived from the betweenness centrality 
measurement. The plots obtained for the ER model 
(first line of Figure [5]) indicate that a threshold T — 1 
is capable of joining almost all subgraphs using ap- 
proximately |y+| = 5% of the nodes in the network, 
including the nodes inside subgraphs. These results 
show that subgraphs with central nodes tend to be 
close one another in ER networks, a feature already 
noticed in the analysis of distance histograms. Nev- 
ertheless, with the gradual merging we are able to 
identify which (and how many) nodes are more rele- 
vant while joining all subgraphs. Thus, only 5% of the 
nodes in ER networks is enough to group its central 
subgraphs in one connected component, reinforcing 
the idea of a "central region" introduced in the be- 
ginning of this subsection. WS networks show similar 
results, where all subgraphs are merged when T = 2 
using approximately \V~^\ = 10% of network nodes. 
The central region is more prominent in BA networks 
because nodes with high betweenness centrality tend 
to form a single connected component from the onset 
of the gradual merging (which, by the way, behaves 
simply as a dilation in this case). Geographical net- 
works, on the other hand, need a threshold T = 13 to 
connect all subgraphs, encompassing approximately 
|y+| — 45% of network nodes. Indeed, as already 
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FIG. 7: Average distance histograms and respective standard deviations considering 100 realizations ol each network 
model. The subgraphs were created using the nodes (2.5% of A'') with the highest clustering coefficient. 



observed in this subsection, subgraphs consisting of 
central nodes are likely to be distant from each other 
in GG networks, thus the high threshold and \V~^\ 
necessary to connect all subgraphs. 

Figure[H]shows the number of subgraphs C"*" and the 
number of nodes \V^\ in the gradual merging of sub- 
graphs with high clustering coefficient. The left panel 
of this figure shows that the chosen subgraphs tend 
to be composed of single nodes, specially in BA net- 
works (notice that the maximum number of subgraphs 
in this case is 2.5% of N, when all subgraphs arc uni- 
tary). Moreover, for each network model the threshold 
necessary to join all subgraphs is consistently higher 
than that observed in Figure [HI showing again that 
clustered subgraphs are farther from each other than 
central subgraphs. More detailed features of Figure O 
were also noticed: ER and BA networks have C"^ — 1 
when T = 2 (C+ drops faster in BA than in ER), 
using less than 10% of network nodes. Subgraphs in 
WS networks are all merged when T = 4, with more 
than 20% of network nodes included in the merging. 
Geographical networks show again the most distinct 
results: a threshold T « 25 is necessary to bring to- 
gether all subgraphs, when almost the entire network 
(more than 90% of the nodes) is included in the merg- 
ing. 

The results reported in this subsection show a re- 
markable difference between the GG model and the 
other three models. Geographical networks demand 
considerable higher thresholds to merge all subgraphs, 
possibly a consequence of the distance parameter that 
controls the creation of edges. The BA model shows a 
distinctive feature for highly central subgraphs: they 
are, in fact, a single connected component that groups 
all nodes with high betweenness centrality. These 
nodes are likely to be the hubs, since betweenness 



centrality and degree were shown to be highly corre- 
lated in BA networks [22]. On the other hand, highly 
clustered nodes are apart from each other in the BA 
model, which indicates that these nodes tend to ap- 
proximate the periphery of BA networks. ER and WS 
networks show similar results, although in the WS 
model subgraphs are consistently farther away from 
each other than in the ER model. Notice that both 
models have low average shortest paths; nevertheless, 
the WS model has higher average clustering coefficient 
than the ER model, which may indicate the reason for 
the observed differences in their results. 



B. Real- World Networks 

The distance histograms for real networks consider- 
ing subgraphs derived from the betweenness central- 
ity measurement are shown in Figure 1101 Three net- 
works (NetScience, Email and Internet- AS) have cen- 
tral nodes contained in the same subgraph, thus only 
distance is counted in their histograms. This ob- 
servation may indicate a weakness in Internet's archi- 
tecture: although Internet is not centrally controlled, 
some important autonomous systems (according to 
the betweenness centrality) are physically connected 
to each other, thus forming a central group of nodes 
that may corrupt the entire network if attacked. The 
NetScience and Email networks show the same dis- 
tance distribution, which means that a small group of 
people plays an important role concentrating a consid- 
erable amount of knowledge/information flow among 
nodes. Subgraphs in the Power Grid network are 
slightly more separated from each other. Neverthe- 
less, almost 70% of its subgraph distances are equal 
to 1, indicating that the majority of central subgraphs 
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are close to each other in this network, which may be 
considered a security flaw in the Power Grid. 

As for models ER, WS, BA and GG ( Section IvSI, 
real networks have clustered nodes more scattered 
over the network than central nodes (Figure [TT] shows 
the distance histograms for subgraphs based on high 
clustering coefficient). At one extremity is the Power 
Grid, with subgraphs distant at most 33 units of dis- 
tance, with peaks at distances 9 and 18. This fact 
suggests a good tolerance to random failures, since 
the network has redundancy of connections spread 
over the network. At the opposite side is the Inter- 



net, with all clustered subgraphs near one another 
(at most with distance 2). We argue here that au- 
tonomous systems should have more distributed clus- 
tered subgraphs in order to avoid bottlenecks at re- 
gions far away from the observed clustered regions. 
The NetScience and Email networks have distance 
peaks at 5 and 3, respectively, indicating that clus- 
ters of collaborators/acquaintances are not too distant 
from each other in these networks. 

Figure [12] depicts the gradual merging resulting 
in real networks with subgraphs constructed using 
the betweenness centrality measurement. As already 
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mentioned in this section, NetScience, Email and In- 
ternet networks have one single central subgraph, i.e. 
C"*" = 1 for any merging threshold. Thus, their merg- 
ing acts as a dilation with a steep increase of |T^^|. 
The Power Grid has only a few subgraphs joined at 
T = 1, using 5% of network nodes. Nevertheless, 
the entire Power Grid network is only covered by the 
merging when T Pi 55, which is in accordance with 
the slow growth of | V~^ \ observed for geographical net- 
works in Section IIV Al 

Rather different results were found for subgraphs 
based on nodes with high clustering coefficient (see 
Figure [131). High clustered nodes tend to be separated 



from each other in these real networks, as already ob- 
served for models ER, WS, BA and GG (i.e. C+ -^ 
2.5% when T = 0, specially for Email, Power Grid 
and Internet networks, showing that almost every sub- 
graph is composed of single nodes). Remarkably, less 
than 5% of Internet nodes are capable of joining all its 
subgraphs at T = 2. Subgraphs in the Email network 
are also quickly joined (at T = 3), although demand- 
ing almost 35% of its nodes. NetScience maintains its 
subgraphs separated until T = 5, when almost 25% 
of its nodes are merged. These observations show 
that, although Internet, Email and NetScience have 
subgraphs quickly merged, they are more cohesive in 
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the Internet because much less nodes are necessary 
to bring them together in a single connected compo- 



nent. The Power Grid only joins its subgraphs when 
T = 9, which is again in accordance with previous re- 



suits fSection llV Ap that show a slow merging in GG 
networks. 

Although the Power Grid can be regarded as a ge- 
ographical network, it was previously associated with 
the WS model because it shows the small-world ef- 
fect [13] ■ Nevertheless, in the experiments reported 
here the Power Grid shows rather different behaviors 
than the observed in the WS model. The Internet, al- 
though being geographically constrained, shows very 
different results than model GG in all experiments we 
have performed. In fact, the Internet is better associ- 
ated with the BA model, since it shows a power-law 
degree distribution Q. Indeed, the Internet at the 
autonomous system level shows results similar to the 
ones obtained for the BA model: central nodes are all 
connected in only one subgraph and clustered nodes 
are more scattered over the network. NetScience and 
Email networks also present a behavior similar to the 
observed in the BA model. NetScience can indeed be 
associated with the BA model, as power-law degree 
distributions were observed in scientific collaboration 
networks f23i|, although the Email network deviates 
from this model by having an exponential distribu- 
tion of degrees [20| . 



V. CONCLUDING REMARKS 

In this paper we presented a framework for charac- 
terizing the distribution of critical subgraphs in com- 
plex networks. We adopted distance histograms to 
assess the overall relationship between subgraphs and 
also developed an algorithm to sequentially merge 
subgraphs according to a metric of node relevance. 
The merging approach complements the distance his- 
togram by identifying which (and how many) nodes 
are necessary to join two or more subgraphs in the 
same connected component. 

Rather than characterizing single nodes exclusively, 
the proposed framework operates at a higher topolog- 
ical level by analyzing groups of nodes and their in- 
terconnectivity. Closely related topological levels have 
been the focus of many network-based studies, such as 



the analysis of communities and motifs. Nevertheless, 
differently than communities, the method proposed in 
this paper does not create a partition of the network, 
and and also does not identify small subgraph patterns 
(i.e. motifs). Our main motivation is to analyze the 
interconnectivity and dispersion of similar (according 
to any desired criteria) groups of nodes independently 
of their size. 

We illustrated our method by analyzing critical 
subgraphs with respect to both theoretical and real- 
world networks. Subgraphs comprising nodes with 
high betweenness centrality were found to be close 
one another in models Erdos-Renyi, Watts-Strogatz 
and Barabasi- Albert, as well as in the following real- 
world networks: Email, NetScience and Internet-AS. 
All these networks also presented clustered subgraphs 
(i.e. with nodes with high clustering coefficient) close 
to each other, although a bit farther than the sub- 
graphs based on the centrality measurement. Further- 
more, both types of subgraphs were found to be more 
distant one another in the Geographical model and 
also in the Power Grid network. The experimental 
findings reported in this paper contribute to a better 
understanding of the structure of the aforementioned 
networks, allowing us to draw some conclusions about 
dynamical processes taking place on networks. 

Further work may focus on similar analysis using 
other networks, as well as different types of subgraphs. 
Another interesting investigation would be to expand 
the proposed framework using hierarchical/concentric 
measurements J12l [ , thus allowing the analysis of sub- 
graph neighborhood at different hierarchical levels. 
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FIG. 13: Number of subgraphs C^ (top plot) and number of nodes \V^\ (bottom plot) in the gradual merging performed 
for the real networks. The subgraphs were created using the nodes (2.5% of A'^) with the highest clustering coefficient. 
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